The Portiloop: A deep learning-based open science tool for closed-loop brain stimulation

Closed-loop brain stimulation refers to capturing neurophysiological measures such as electroencephalography (EEG), quickly identifying neural events of interest, and producing auditory, magnetic or electrical stimulation so as to interact with brain processes precisely. It is a promising new method for fundamental neuroscience and perhaps for clinical applications such as restoring degraded memory function; however, existing tools are expensive, cumbersome, and offer limited experimental flexibility. In this article, we propose the Portiloop, a deep learning-based, portable and low-cost closed-loop stimulation system able to target specific brain oscillations. We first document open-hardware implementations that can be constructed from commercially available components. We also provide a fast, lightweight neural network model and an exploration algorithm that automatically optimizes the model hyperparameters to the desired brain oscillation. Finally, we validate the technology on a challenging test case of real-time sleep spindle detection, with results comparable to off-line expert performance on the Massive Online Data Annotation spindle dataset (MODA; group consensus). Software and plans are available to the community as an open science initiative to encourage further development and advance closed-loop neuroscience research [https://github.com/Portiloop].

-19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 Score: 0.136 The integrated gradients algorithm enables exploring why the model takes a given decision (the more a portion of the signal is represented in red, the highest its influence on the current output of the ANN). Grey windows are past inputs, whereas the black window is the current input: the past influences the current output due to the RNN. Here, the model finds that it is looking at the aftermath of a sleep spindle. With our time dilation and window size, a small portion of the window overlaps from one sample to the next. We see that this portion (at the left-hand side of each window) is in fact ignored by the model. Therefore, it is probably possible to shrink our model even more, although PMBO did not find this. In the future, this type of visualizations might also help experts better understand what sleep spindles are by revealing unknown influences.

APPENDIX C PMBO HYPERPARAMETERS
• noise type 1: portion of the m sampled models that are sampled randomly in the whole hyperparameter space, instead of in a Gaussian around the last completed experiment. • noise type 2: portion of the time when a model is sampled randomly in the m models, instead of being selected by its Pareto efficiency.
The ANN used for our meta model is a simple Multi-Layer Perceptron (MLP) of 3 fully connected layers. The hyperparameters we use in PMBO are summarized in Table I.

APPENDIX I ONLINE STANDARDIZATION THROUGH EXPONENTIAL MOVING AVERAGE
We standardize the signal on the fly though exponential moving average. In other words, we transform the filtered signal s(t) to s ′ (t) according to: whereμ andσ are estimates of the mean and standard deviation of the filtered signal, computed as follows: α µ and α σ being hyperparameters in [0, 1]. This custom real-time standardization makes the signal comparable from one subject to another, enabling generalizable learning.

APPENDIX J PARALLEL MODEL-BASED OPTIMIZATION
When developing a novel Portiloop application, the practitioner needs to devise a neural network that is both high-performance and lightweight, by selecting the right set of hyperparameters H in the space of all possible hyperparameter sets H. Such hyperparameters include the size of the sliding window, the number of layers in each part of the ANN, the width of each layer, the time dilation (see Section K), the type of optimizer, the hyperparameters of the optimizer itself, etc. H can be very large, and finding a set of hyperparameters that yields a high-performance model within given hardware constraints is far from trivial. We introduce "Parallel Model-Based Optimization" (PMBO), a network-based algorithm that automates this process in a parallel fashion. Released as open-source along with our code, PMBO is essentially a parallelized and evolved version of "Probabilistic SMBO" [1]. PMBO is a guided-search approach that finds a suitable set of hyperparameters rapidly. For this matter, it uses one machine (or process) to train a meta network whose role is to predict non-trivially available costs for any given set of hyperparameters. Furthermore, PMBO uses any available machines (or processes) in parallel to train ANNs from sets of hyperparameters selected based on the cost estimated by the meta model.
The purpose of PMBO is to find Pareto-optimal sets of hyperparameters that minimize both a software cost and a hardware cost. Hyperparameter selection is a bi-objective problem in our setting: on one hand, we want an ANN that performs well at detecting the desired patterns. We measure this performance in terms of the f1-score of our model. The f1-score depends both on the precision (how sure we are that positive outputs are true positives) and on the recall (how sure we are that we capture all true positives) of the model. It is defined as: where precision = true positives true positives + false positives and recall = true positives true positives + false negatives (8) The f1-score is in [0, 1], with 1 being a perfect classifier. We cast this into a minimization problem by defining our software cost as L s = 1 − f1.
On the other hand, we want our ANN to be as lightweight as possible, so it fits in the limited memory of the Portiloop and executes as quickly as possible. A precise measurement of the execution duration and amount of memory needed on the board is difficult. In fact, we do not have access to these results until the model is actually synthesized on the board, which is a lengthy process. Thus, we use the number of trainable weights in the ANN as a proxy for these concerns, and call this number our hardware cost L h . Note that our choice of costs is arbitrary and other custom costs can be used instead.
Let us consider a meta dataset E of previously completed experiments (i.e., tuples E = (H, L s , L h ) of hyperparameter sets with their real costs). Let us also consider the following Pareto-dominant relation: (NB: in the context of this minimization problem, dominating means having the smallest costs, hence the notation) For a given experiment E and meta dataset E, we denote D E (E) as the number or experiments in E that are Pareto-dominated by E, and d E (E) as the number of experiments in E that Pareto-dominate E. In other words: For a given set of hyperparameters H ∈ H, we further define an estimate of the corresponding completed experiment E aŝ E = (H,L s ,L h ), whereL s andL h are estimates of the real costs, computed by the meta network. Note that in our setting, L h = L h is available and thus onlyL s is estimated by the meta network. We propose the heuristic Pareto efficiency η(Ê): where a E (Ê) promotes hyperparameter sets whose predicted costs are not dominated by many experiments in E: b E (Ê) promotes hyperparameter sets whose predicted costs dominate many experiments in E: s E (Ê) promotes hyperparameter sets whose predicted software costs are better than the best software cost amongst all completed experiments in E: and h E (Ê) penalizes hyperparameter sets that have a high density with respect to experiments present in E in terms of their hardware cost. More precisely, we define a range of hardware costs we are interested in, and we split this range into a number of bins. We then compute the binned density of experiments in E over this range, and multiply this density by the range's width. The penalty h E (Ê) is the height of the resulting bin whereL h stands. Multiplying the density by the range's width enforces h E (Ê) > 1 in regions of high density and h E (Ê) < 1 in regions of low density. Fig 12 explains our PMBO algorithm. A central meta learner is communicating with n peripheral workers to find a hyperparameter set H * ∈ H that is Pareto-optimal for both the software and hardware costs (i.e., non-Pareto-dominated by any other set). For this matter, the algorithm uses a meta dataset E of tuples E i = (H i , L i s , L i h ) to train a meta network that maps any hyperparameter set H ∈ H to its corresponding (estimated) costsL s andL h 1 . Once the meta network is trained, it is used to guide the sampling process. More exactly, we sample m hyperparameter sets in H from a multivariate Gaussian distribution around the hyperparameter set corresponding to the last results received from the workers by the meta learner. Sets that don't satisfy user-defined constraints (e.g., that have already been tested, or, when L h is available, that fall outside the range we are interested in) are discarded and resampled. We then select the best set in terms of Pareto efficiency, estimated thanks to the trained meta network. This selected set is appended to a buffer, waiting to be consumed by an idle worker. The sampling process is repeated until the buffer is full. When a worker is idle, it fetches an ANN architecture and training instructions from the buffer. The worker yields a measurement of the real hardware and software costs for the current hyperparameter set, which are sent back to the meta learner and appended to the meta dataset. The meta learner then uses the updated meta dataset to train a new meta network, and so on.

1L
h is an output of the meta network in the general case. However, with our choice of hardware cost it is not, since the ground truth L h is available.  The algorithm serves to automate the selection of hyperparameters within given hardware constraints. It is based on a standard single-producer and multiple-consumers architecture. A "meta learner" is in charge of producing relevant hyperparameter sets in a guided fashion. It sends these sets to "workers", and keeps producing new sets as long as idle workers remain. Each worker that has received a new set starts training an ANN corresponding to the assigned set. Once this training ends, the real costs of the hyperparameter set can be computed and are sent back to the meta learner. The best-performing models can then be selected for implementation.

APPENDIX K VIRTUAL ANN PARALLELIZATION WITH TIME DILATION
Given the Portiloop's design constraints, we sought a lightweight means of allowing our resource-restricted network to use as much signal history as possible (as do larger neural networks). Time dilation [2] is a technique that enables recurrent units such as Gated Recurrent Units (GRUs) to look further back in time before gradients vanish, at no computational cost. We propose a version of this technique that allows us to virtually parallelize a single physical ANN into several decoupled virtual models. Our approach enables shallow recurrent neural networks to look further back in time by skipping the redundant information that is inherent to the use of a sliding window as input, while still acting as fast as possible. Fig 13 (a) illustrates how time dilation can be used to look further back in time and avoid redundancy.
Although time dilation enables reaching further back in time at no extra computational cost, this comes with a cost in terms of delays. Since our technique causes samples to be skipped between forward passes in the ANN, a detection delay that can be as long as the time dilation is introduced. We correct this issue by implementing a trick that we call virtual parallelization. We create a First In First Out (FIFO) list as large as the time dilation, and fill this list with independent hidden states 2 . At each time-step, we pop a hidden state from this list, feed it to the recurrent units of our ANN, perform a forward pass, and append the resulting hidden state to the list. Doing this without skipping samples is equivalent to having several decoupled models running in parallel as illustrated in Fig 13 (b), although one single ANN is physically used. This trick allows us to keep acting as fast as possible since it removes the need for skipping samples, while still reaching far back in time at no extra computational cost. Fig. 13. A lightweight solution to using historical signal information. To "look" farther back in time, which facilitates detection accuracy, we introduce (a) Time dilation, and (b) Virtual parallelization. In (a), a sliding window of the 4 last samples (dotted curves) is used as input to the model. The time dilation (arrows) is the number of samples between two forward passes in the ANN. When it is small (top), two consecutive windows overlap (see e.g., green and red windows), meaning consecutive passes contain redundant information. When the time dilation is large (bottom), this issue is corrected, and back-propagation will reach much further back in time for the same number of forward passes. Virtual parallelization reduces delay. In (b), the time dilation is 2, so we keep track of 2 independent hidden states and feed these alternately to the ANN. This trick removes the delay of time dilation.